Name: Huimiao Chen, JHED ID: hchen185, Email: hchen185@jhu.edu.

(1) Look at the chapter on interactive graphics and, specifically, the code to display a subject's MRICloud data as a sunburst plot. Do the following. Display this subject's data as a Sankey diagram. Display as many levels as you can (at least 3) for Type = 1, starting from the intracranial volume. Put this in a file called hw4.ipynb.

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import random


# load data and pre-process

## data urls
url_1 = "https://raw.githubusercontent.com/smart-stats/ds4bio_book/main/book/assetts/kirby21AllLevels.csv"
url_2 = "https://raw.githubusercontent.com/bcaffo/MRIcloudT1volumetrics/master/inst/extdata/multilevel_lookup_table.txt"

## load in the hierarchy information
multilevel_lookup = pd.read_csv(url_2, sep = "\t").drop(['Level5'], axis = 1)
multilevel_lookup = multilevel_lookup.rename(columns = {
    "modify"   : "roi", 
    "modify.1" : "level4",
    "modify.2" : "level3", 
    "modify.3" : "level2",
    "modify.4" : "level1"})
multilevel_lookup = multilevel_lookup[['roi', 'level4', 'level3', 'level2', 'level1']]

## load in the subject data
id = 127
subjectData = pd.read_csv(url_1)
subjectData = subjectData.loc[(subjectData.type == 1) & (subjectData.level == 5) & (subjectData.id == id)]
subjectData = subjectData[['roi', 'volume']]

## merge the subject data with the multilevel data
subjectData = pd.merge(subjectData, multilevel_lookup, on = "roi")
subjectData = subjectData.assign(icv = "ICV")
subjectData = subjectData.assign(comp = subjectData.volume / np.sum(subjectData.volume))

## print data
print(subjectData)


# prepare data for Sankey diagram

## initialize an empty dictionary
data_dict = {}

## prepare node names and colors
data_dict["node"] = {"label": [], "color": []}
### prepare node names
cols = ["icv", "level1", "level2", "level3", "level4", "roi"]
node_matrix = subjectData.loc[:, cols].values
for i in range(len(node_matrix)): # add level names as prefixes to avoid same names from different levels
    for j, level in enumerate(cols):
        node_matrix[i][j] = level + ":" + node_matrix[i][j] # the prefixes will be deleted later 
node_names = node_matrix.flatten(order='F').tolist()
data_dict["node"]["label"] = list(set(node_names))
### prepare node colors
opacity_node = 0.8
colors = [(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255), opacity_node) 
          for i in range(len(data_dict["node"]["label"]))] # generate RGBA color tuples with opacity
color_strings = ['rgba({}, {}, {}, {})'.format(r, g, b, a) 
                 for r, g, b, a in colors] # convert the RGBA color tuples to a list of RGBA color strings
data_dict["node"]["color"] = color_strings

## prepare link sources, targets, values, and colors
data_dict["link"] = {"source": [], "target": [], "value": [], "color": []}
label_No = {} # initialize node numbers dict
for index_, label_ in enumerate(data_dict["node"]["label"]):
    label_No[label_] = index_ # assign numbers to nodes
cols = ["icv", "level1", "level2", "level3", "level4", "roi", "comp"]
subjectData_new = subjectData.loc[:, cols]
for i in subjectData_new.index:
    for j in subjectData_new.columns[1:-1]: # exclusive of "icv" and "comp"
        source_name = cols[cols.index(j) - 1] + ":" + subjectData_new.loc[i, cols[cols.index(j) - 1]]
        source_number = label_No[source_name]
        target_name = j + ":" + subjectData_new.loc[i, j]
        target_number = label_No[target_name]
        value = subjectData_new.loc[:, cols].groupby(j).sum().loc[subjectData_new.loc[i, j],"comp"]
        if_break = False # begin check whether the data has been stored
        for a, b, c in zip(data_dict["link"]["source"], data_dict["link"]["target"], data_dict["link"]["value"]):
            if a == source_number and b == target_number and c == value:
                if_break = True
        if if_break:
            pass
        else:
            data_dict["link"]["source"].append(source_number)
            data_dict["link"]["target"].append(target_number)
            data_dict["link"]["value"].append(value)
opacity_link = 0.4
data_dict["link"]["color"] = [data_dict["node"]["color"][src].replace(str(opacity_node), str(opacity_link))
                                    for src in data_dict["link"]["source"]] 
for index_, label_ in enumerate(data_dict["node"]["label"]): # the prefixes of node names are deleted
    data_dict["node"]["label"][index_] = label_.split(":")[1]


# plot Sankey diagram

fig = go.Figure(data=[go.Sankey(
    # valueformat = ".0000f",
    valuesuffix = "(comp)",
    # Define nodes
    node = dict(
      pad = 15,
      thickness = 15,
      line = dict(color = "black", width = 0.5),
      label =  data_dict['node']['label'],
      color =  data_dict['node']['color']
    ),
    # Add links
    link = dict(
      source =  data_dict['link']['source'],
      target =  data_dict['link']['target'],
      value =  data_dict['link']['value'],
      color =  data_dict['link']['color']
))])

fig.update_layout(title_text="Structure of the intracranial volume<br>Source: Multi-level MRICloud data: <a href='https://raw.githubusercontent.com/smart-stats/ds4bio_book/main/book/assetts/kirby21AllLevels.csv'>kirby21AllLevels</a>",
                  font_size=10)
fig.show()
# Note: The ipynb file will not display the interactive graphics unless you rerun the code.
# The interactive graphics can be accessed via the urls in Q2. 
                roi  volume                      level4              level3  \
0             SFG_L   12926                       SFG_L           Frontal_L   
1             SFG_R   10050                       SFG_R           Frontal_R   
2         SFG_PFC_L   12783                       SFG_L           Frontal_L   
3         SFG_PFC_R   11507                       SFG_R           Frontal_R   
4        SFG_pole_L    3078                       SFG_L           Frontal_L   
..              ...     ...                         ...                 ...   
275  Chroid_LVetc_L     444  AnteriorLateralVentricle_L  LateralVentricle_L   
276  Chroid_LVetc_R     371  AnteriorLateralVentricle_R  LateralVentricle_R   
277    IV_ventricle    2700                IV_ventricle        IV_ventricle   
278          ECCL_L     292                  inf_DPWM_L        InferiorWM_L   
279          ECCL_R     292                  inf_DPWM_R        InferiorWM_R   

               level2           level1  icv      comp  
0    CerebralCortex_L  Telencephalon_L  ICV  0.009350  
1    CerebralCortex_R  Telencephalon_R  ICV  0.007270  
2    CerebralCortex_L  Telencephalon_L  ICV  0.009247  
3    CerebralCortex_R  Telencephalon_R  ICV  0.008324  
4    CerebralCortex_L  Telencephalon_L  ICV  0.002227  
..                ...              ...  ...       ...  
275        Ventricle               CSF  ICV  0.000321  
276        Ventricle               CSF  ICV  0.000268  
277         Ventricle              CSF  ICV  0.001953  
278     WhiteMatter_L  Telencephalon_L  ICV  0.000211  
279     WhiteMatter_R  Telencephalon_R  ICV  0.000211  

[280 rows x 8 columns]

(2) Create a simple webpage containing this graphic and host it on github pages. -Do not- host this off of your assignment repo from github classroom, since this is not public. Instead, you'll have to create a new public repo from your regular github account and add this file. Put the link to your live web page in a markdown cell of your hw4.ipynb file. Note, an easy way to create a webpage with this graphic is to export an ipynb as an html file.

In [2]:
# method 1: generate an html file using the folloing code 
# method 2: directly save the .ipynb file as a html file,
# which will keep the figure drawn in Q1 interactive.

fig.write_html(file="HW4_SankeyDiagram.html")

The html files generated by the two methods mentioned in the last chunk can be accessed in the following two urls from my public github repo:

  • https://huimiaochen.github.io/DS4PH-Spring2023-HW4/HW4_SankeyDiagram.html;

  • https://huimiaochen.github.io/DS4PH-Spring2023-HW4/HW4_JupyterNotebook.html (needs to scroll down).

(3) Create the opioid sqlite database from https://smart-stats.github.io/ds4bio_book/book/_build/html/sqlite.html. However, only go to the step where the csv files are read into the database. Then exit sqlite and you should have a file opioid.db that has the data. Next, read the three tables into pandas dataframes and do the data wrangling from the sqlite chapter directly in pandas. Add the python code to your hw4.ipynb file.

After downloading the three csv files into the working directory, I run the following commands in the prompt/command line to obtain the required opioid.db file, which is the input data of the code in the next chunk.

jupyter-hchen185@ip-172-31-29-195:~/HW4$ sqlite3 opioid.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import county_pop_arcos.csv population
sqlite> .import county_annual.csv annual
sqlite> .import land_area.csv land
sqlite> .quit
In [3]:
import sqlite3 as sq3
import pandas as pd


# read data into pandas

## creat sql connection
con = sq3.connect("opioid.db")

# read the three tables into pandas dataframes
population = pd.read_sql_query("SELECT * from population", con)
annual = pd.read_sql_query("SELECT * from annual", con)
land = pd.read_sql_query("SELECT * from land", con)

## close sql connection
con.close


# print data heads
print("---*-data preview starts-*---")
print("-*-The following is the head of the population table-*-")
print(population.head())
print("---*---*---")
print("-*-The following is the head of the annual table-*-")
print(annual.head())
print("---*---*---")
print("-*-The following is the head of the land table-*-")
print(land.head())
print("---*-data preview ends-*---")


# data wrangling

print("---*-data wrangling starts-*---")

## print out a few columns of the population table
data_select = population.loc[:, ["BUYER_COUNTY", "BUYER_STATE", "STATE", "COUNTY", "year", "population"]]
print("-*-The following is the head of a few columns in the population table-*-")
print(data_select.head(5))
print("---*---*---")

## missing data in the annual table
data_miss = annual.loc[annual.countyfips == "NA"]
print("-*-The following is the head of missing data in the annual table-*-")
print(data_miss.head(10))
print("---*---*---")

## missing data in the annual table other than Puerto Rico (PR)
data_miss_noPR = annual.loc[(annual.countyfips == "NA") & (annual.BUYER_STATE != "PR")]
print("-*-The following is the head of missing data in the annual table other than Puerto Rico (PR)-*-")
print(data_miss_noPR.head(10))
print("---*---*---")

## set the Montgometry countyfips to the correct value 05097 and ignore other missing values
print("-*-The following is the data of Montgomery county AR in the annual table-*-")
print(annual[(annual.BUYER_STATE == "AR") & (annual.BUYER_COUNTY == "MONTGOMERY")])
annual.loc[(annual.BUYER_STATE == "AR") & (annual.BUYER_COUNTY == "MONTGOMERY"), "countyfips"] = 5097 # update "annual"
print("-*-The following is the data of Montgomery county AR in the revised annual table-*-")
print(annual[(annual.BUYER_STATE == "AR") & (annual.BUYER_COUNTY == "MONTGOMERY")])
print("---*---*---")

## delete rows from the annual table that have missing county data
print("-*-The following are the rows with BUYER_COUNTY==NA in the annual table-*-")
print(annual[annual.BUYER_COUNTY=="NA"])
annual.drop(annual[annual.BUYER_COUNTY=="NA"].index, inplace=True)
print("-*-The following are the rows with BUYER_COUNTY==NA in the revised annual table-*-")
print(annual[annual.BUYER_COUNTY=="NA"]) # expected to be empty
print("---*---*---")

## grab three columns from the land table and rename the column STCOU to coutyfips
land_area = land.loc[:, ["Areaname","STCOU","LND110210D"]]
land_area.rename(columns={"STCOU":"countyfips"}, inplace=True)
print("-*-The following are the head of the land_area table-*-")
print(land_area.head())
print("---*---*---")

## join the population and land_area tables to create a new table county_info
county_info = population.merge(land_area, how="left", on="countyfips")
print("-*-The following are the head of the county_info table-*-")
print(county_info.head())
print("---*---*---")

## print out the counts to make sure we accounted correctly
print("the count of the land table is:", land.shape[0])
print("the count of the land_area table is:", land_area.shape[0])
print("the count of the county_info table is:", county_info.shape[0])
print("the count of the population table is:", population.shape[0])
---*-data preview starts-*---
-*-The following is the head of the population table-*-
     BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name  \
0  1      AUTAUGA          AL      01001     1      1     Autauga   
1  2      BALDWIN          AL      01003     1      3     Baldwin   
2  3      BARBOUR          AL      01005     1      5     Barbour   
3  4         BIBB          AL      01007     1      7        Bibb   
4  5       BLOUNT          AL      01009     1      9      Blount   

                      NAME    variable  year population  
0  Autauga County, Alabama  B01003_001  2006      51328  
1  Baldwin County, Alabama  B01003_001  2006     168121  
2  Barbour County, Alabama  B01003_001  2006      27861  
3     Bibb County, Alabama  B01003_001  2006      22099  
4   Blount County, Alabama  B01003_001  2006      55485  
---*---*---
-*-The following is the head of the annual table-*-
     BUYER_COUNTY BUYER_STATE  year count DOSAGE_UNIT countyfips
0  1    ABBEVILLE          SC  2006   877      363620      45001
1  2    ABBEVILLE          SC  2007   908      402940      45001
2  3    ABBEVILLE          SC  2008   871      424590      45001
3  4    ABBEVILLE          SC  2009   930      467230      45001
4  5    ABBEVILLE          SC  2010  1197      539280      45001
---*---*---
-*-The following is the head of the land table-*-
           Areaname  STCOU LND010190F  LND010190D LND010190N1 LND010190N2  \
0  1  UNITED STATES  00000          0  3787425.08        0000        0000   
1  2        ALABAMA  01000          0    52422.94        0000        0000   
2  3    Autauga, AL  01001          0      604.49        0000        0000   
3  4    Baldwin, AL  01003          0     2027.08        0000        0000   
4  5    Barbour, AL  01005          0      904.59        0000        0000   

  LND010200F  LND010200D LND010200N1  ... LND110210N1 LND110210N2 LND210190F  \
0          0  3794083.06        0000  ...        0000        0000          0   
1          0    52419.02        0000  ...        0000        0000          0   
2          0      604.45        0000  ...        0000        0000          0   
3          0     2026.93        0000  ...        0000        0000          0   
4          0      904.52        0000  ...        0000        0000          0   

  LND210190D LND210190N1 LND210190N2 LND210200F LND210200D LND210200N1  \
0  251083.35        0000        0000          0  256644.62        0000   
1    1672.71        0000        0000          0    1675.01        0000   
2       8.48        0000        0000          0       8.48        0000   
3     430.55        0000        0000          0     430.58        0000   
4      19.59        0000        0000          0      19.61        0000   

  LND210200N2  
0        0000  
1        0000  
2        0000  
3        0000  
4        0000  

[5 rows x 35 columns]
---*-data preview ends-*---
---*-data wrangling starts-*---
-*-The following is the head of a few columns in the population table-*-
  BUYER_COUNTY BUYER_STATE STATE COUNTY  year population
0      AUTAUGA          AL     1      1  2006      51328
1      BALDWIN          AL     1      3  2006     168121
2      BARBOUR          AL     1      5  2006      27861
3         BIBB          AL     1      7  2006      22099
4       BLOUNT          AL     1      9  2006      55485
---*---*---
-*-The following is the head of missing data in the annual table-*-
         BUYER_COUNTY BUYER_STATE  year count DOSAGE_UNIT countyfips
187  188     ADJUNTAS          PR  2006   147      102800         NA
188  189     ADJUNTAS          PR  2007   153      104800         NA
189  190     ADJUNTAS          PR  2008   153       45400         NA
190  191     ADJUNTAS          PR  2009   184       54200         NA
191  192     ADJUNTAS          PR  2010   190       56200         NA
192  193     ADJUNTAS          PR  2011   186       65530         NA
193  194     ADJUNTAS          PR  2012   138       57330         NA
194  195     ADJUNTAS          PR  2013   138       65820         NA
195  196     ADJUNTAS          PR  2014    90       59490         NA
196  197       AGUADA          PR  2006   160       49200         NA
---*---*---
-*-The following is the head of missing data in the annual table other than Puerto Rico (PR)-*-
             BUYER_COUNTY BUYER_STATE  year count DOSAGE_UNIT countyfips
10071  10072         GUAM          GU  2006   319      265348         NA
10072  10073         GUAM          GU  2007   330      275600         NA
10073  10074         GUAM          GU  2008   313      286900         NA
10074  10075         GUAM          GU  2009   390      355300         NA
10075  10076         GUAM          GU  2010   510      413800         NA
10076  10077         GUAM          GU  2011   559      475600         NA
10077  10078         GUAM          GU  2012   616      564800         NA
10078  10079         GUAM          GU  2013   728      623200         NA
10079  10080         GUAM          GU  2014   712      558960         NA
17429  17430   MONTGOMERY          AR  2006   469      175390         NA
---*---*---
-*-The following is the data of Montgomery county AR in the annual table-*-
             BUYER_COUNTY BUYER_STATE  year count DOSAGE_UNIT countyfips
17429  17430   MONTGOMERY          AR  2006   469      175390         NA
17430  17431   MONTGOMERY          AR  2007   597      241270         NA
17431  17432   MONTGOMERY          AR  2008   561      251760         NA
17432  17433   MONTGOMERY          AR  2009   554      244160         NA
17433  17434   MONTGOMERY          AR  2010   449      247990         NA
17434  17435   MONTGOMERY          AR  2011   560      313800         NA
17435  17436   MONTGOMERY          AR  2012   696      339520         NA
17436  17437   MONTGOMERY          AR  2013   703      382300         NA
17437  17438   MONTGOMERY          AR  2014   491      396900         NA
-*-The following is the data of Montgomery county AR in the revised annual table-*-
             BUYER_COUNTY BUYER_STATE  year count DOSAGE_UNIT countyfips
17429  17430   MONTGOMERY          AR  2006   469      175390       5097
17430  17431   MONTGOMERY          AR  2007   597      241270       5097
17431  17432   MONTGOMERY          AR  2008   561      251760       5097
17432  17433   MONTGOMERY          AR  2009   554      244160       5097
17433  17434   MONTGOMERY          AR  2010   449      247990       5097
17434  17435   MONTGOMERY          AR  2011   560      313800       5097
17435  17436   MONTGOMERY          AR  2012   696      339520       5097
17436  17437   MONTGOMERY          AR  2013   703      382300       5097
17437  17438   MONTGOMERY          AR  2014   491      396900       5097
---*---*---
-*-The following are the rows with BUYER_COUNTY==NA in the annual table-*-
             BUYER_COUNTY BUYER_STATE  year count DOSAGE_UNIT countyfips
27741  27742           NA          AE  2006     2         330         NA
27742  27743           NA          CA  2006    47       12600         NA
27743  27744           NA          CT  2006   305       78700         NA
27744  27745           NA          CT  2007   112       30900         NA
27745  27746           NA          CT  2008    48       15000         NA
27746  27747           NA          FL  2006     9         900         NA
27747  27748           NA          FL  2007     7         700         NA
27748  27749           NA          GA  2006   114       51700         NA
27749  27750           NA          IA  2006     7        2300         NA
27750  27751           NA          IN  2006   292       39300         NA
27751  27752           NA          MA  2006   247      114900         NA
27752  27753           NA          NV  2006   380      173600         NA
27753  27754           NA          NV  2007   447      200600         NA
27754  27755           NA          NV  2008     5        2200         NA
27755  27756           NA          OH  2006    23        5100         NA
27756  27757           NA          PR  2006    10       17800         NA
27757  27758           NA          PR  2007     2        1300         NA
-*-The following are the rows with BUYER_COUNTY==NA in the revised annual table-*-
Empty DataFrame
Columns: [, BUYER_COUNTY, BUYER_STATE, year, count, DOSAGE_UNIT, countyfips]
Index: []
---*---*---
-*-The following are the head of the land_area table-*-
        Areaname countyfips  LND110210D
0  UNITED STATES      00000  3531905.43
1        ALABAMA      01000    50645.33
2    Autauga, AL      01001      594.44
3    Baldwin, AL      01003     1589.78
4    Barbour, AL      01005      884.88
---*---*---
-*-The following are the head of the county_info table-*-
     BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name  \
0  1      AUTAUGA          AL      01001     1      1     Autauga   
1  2      BALDWIN          AL      01003     1      3     Baldwin   
2  3      BARBOUR          AL      01005     1      5     Barbour   
3  4         BIBB          AL      01007     1      7        Bibb   
4  5       BLOUNT          AL      01009     1      9      Blount   

                      NAME    variable  year population     Areaname  \
0  Autauga County, Alabama  B01003_001  2006      51328  Autauga, AL   
1  Baldwin County, Alabama  B01003_001  2006     168121  Baldwin, AL   
2  Barbour County, Alabama  B01003_001  2006      27861  Barbour, AL   
3     Bibb County, Alabama  B01003_001  2006      22099     Bibb, AL   
4   Blount County, Alabama  B01003_001  2006      55485   Blount, AL   

  LND110210D  
0     594.44  
1    1589.78  
2     884.88  
3     622.58  
4     644.78  
---*---*---
the count of the land table is: 3198
the count of the land_area table is: 3198
the count of the county_info table is: 28265
the count of the population table is: 28265

(4) Create an interactive scatter plot of average number of opiod pills by year plot using plotly. See the example here. Don't do the intervals (little vertical lines), only the points. Add your plot to an html file with your repo for your Sanky diagram and host it publicly. Put a link to your hosted file in a markdown cell of your hw4.ipynb file. Note, an easy way to create a webpage with this graphic is to export an ipynb as an html file.

In [4]:
import plotly.express as px

# convert data into float point numbers
annual["DOSAGE_UNIT"] = annual["DOSAGE_UNIT"].astype(float) # some original data in this column is not float

# calculate the average numbers
data_scatter = annual.groupby('year').mean()

# plot the interactive scatter plot of average number of opiod pills by year plot
fig = px.scatter(data_scatter)
fig.update_layout(title_text="Interactive scatter plot of average number of opiod pills by year", font_size=12)
fig.show()

# wirte into the html file

fig.write_html(file="HW4_ScatterPlot.html")

The html files generated for the interactive scatter plot can be accessed in the following two urls from my public github repo (one is generated by wirte_html and the other is generated by exporting the ipynb as an html file):

  • https://huimiaochen.github.io/DS4PH-Spring2023-HW4/HW4_ScatterPlot.html;

  • https://huimiaochen.github.io/DS4PH-Spring2023-HW4/HW4_JupyterNotebook.html (needs to scroll down).

In [ ]: